Well, why indeed? I don't even like basketball and I don't follow
college sports at all (I am an avid NFL fan though). One of my
friends from UT Austin (so way back in my physics days) posted on
Facebook about something called "Coder's Bracket" (which no longer
exists so I haven't worked on this since 2016). You had to create a
javascript routine to predict the winner of every match. The site
provided methods to get team stats for each of the 64 teams (i.e.
team.getRank()
). So for instance, one simple routine, is you could choose a
winner based on the ranking (which, obviously, actually works
pretty well). I joked on his post that if anyone has all of the
stats for previous brackets, I could use machine learning to create
an algorithm...I still had a couple of days until the tournament
started so I ended up doing just that.
I would consider the results mixed but positive. I played this "Coder's Bracket" challenge from 2014-2016 (three years) and then they no longer maintained it so I haven't messed with this since (I really should). My first attempt wasn't very successful and I think it ended up somewhere in the 50th percentile. Then again, my first year, I threw this whole thing together in about 2 days. Since you could submit more brackets, even after the tournament started, I did and those did much better (obviously you couldn't enter these into the contest, you just got an idea of your score).
I was actually very successful in 2015. I think I submitted three brackets and all three were above the 80th percentile and I even had one bracket that was in the top 20 overall (I think among something like 10,000 entries) until Kentucky blew everyone's bracket when they lost in the Final Four to Wisconsin. Overall I pretty consistently was in the 80-90th percentile.
So this was probably the biggest challenge. Where can I find the following stats for every team in the tournament going back as far as I can.
Seed (1-16) | Rank (Overall) | Wins | Losses |
---|---|---|---|
Games Played | Winning Percentage | FG Made | FG Attempted |
FG Percentage | FT Made | FT Attempts | FT Percentage |
3-pt Made | 3-pt Attempted | 3-pt Percentage | Total Rebounds |
Offensive Rebounds | Defensive Rebounds | Rebounds Per Game | Steals |
Steals Per Game | Blocks | Blocks Per Game | Assists |
Assists Per Game | Turnovers | Turnovers Per Game | Round (in tournament) |
This data is available from many different websites and can pretty easily be found. For instance, let's say I want the stats for the 2015 Kentucky Wildcats (just choose a website). So now I just repeat, 64 times for each team—and for each year going back as far as I'm willing to.
I ended up using Sports Reference because 1) it had all of the information I needed easily accessible from a single webpage and 2) everything was written in pure HTML (no server calls to dynamically display data) so the data I needed was embedded in the HTML source.
I used jsoup
to parse the HTML. This was fairly tedious work since I had to
manually figure out how to find the data I needed (usually could
find it by an
id
or
class
). However this was only necessary for a few things:
Here is a code snippet which gives you some idea of how I accomplished this. Reading the brackets themselves are more complex, but the idea is the same (there's just a lot more to look for).
final Document source = Jsoup.parse(new URL(link), 2000); // select all elements which have 'bracket' set as their class attribute final Elements bracketBody = source.select(".bracket"); for (Element bracket : bracketBody) { // check to see if first child is a tbody element...if so, set // bracket to that and proceed if (bracket.child(0).tagName().equals("tbody")) { bracket = bracket.child(0); } // check the first tr's first td element...if it has class // 'align_right seed', then this is a preliminary bracket, otherwise // it's the final four. if (bracket.child(0).child(0).className() .equals("align_right seed")) readPreliminaryBracket(bracket); else readFinalFour(bracket);
The first year I did this (2014), Sports Reference only had digital data going back to 1998 (prior to that they only had photo-copied results as images). They were clearly in the process of digitizing their data because the next year and the following (2015 then 2016) they had more digital data. Unfortunately, they had also updated their website. So every year I did this (2014-2016) I had to rewrite the HTML scraping code. Luckily though, it was a one and done thing: once it was written, I had all of the data I needed for that year.
I used Weka to process my data because I had previously used Weka in my Machine Learning class. The first thing I did was that I took the data I gathered above and put it into an ARFF file. I was then able to manually use Weka to process data for me.
I saw this as a classification problem. Each data set represents a
game: two sets of team data. The task is to classify the data set
as either "team 1" or "team 2" (the winner of the game). There is
immediately going to be a problem with my training set if I don't
take precautions. Theoretically, this determination is symmetric:
meaning if I switch which is
team1
and which is
team2
, the algorithm should also switch its answer. If the data is
organized (when I gather it) so that I end up putting all of the
winners as
team1
, then training set will just recognize that
team1
always wins (or almost always wins) and predict that. Instead,
before I got to Weka, I randomized the ARFF files so that this is
unlikely to happen.
Initially I just put all of the data I had into the ARFF as is. I got data that looked something like this:
This data is somewhat difficult to analyze just looking at it. Look
at the data for
seed1
and
seed2
: they are basically mirror images of each other. If
seed1
is high (e.g. seed 1 is the highest seed), then
team1
(blue) is likely to win and vice versa, if
seed2
is high
team2
(red) is likely to win.
Thinking about the above data, it's not giving much of a direct comparison. If you think rank is the most important stat, then you would expect something like this:
if(rank1 > rank2) team1.winsGame(); else team2.winsGame();
But this doesn't really match how the the construction of the J48
tree works. Instead it's going to create a complicated
case
structure where it finds values for things like
seed1
and
seed2
to split on. With this in mind, I decided it might be beneficial to
combine each of the two stats into a difference, rather than two
absolute values. Here is what that data looks like:
I find this data much easier to make sense of than the raw data. This is showing which data is actually predicting the winner. You can tell when the red and blue areas are separated, each to one side of 0 (i.e. the two values are equal). Take the following two: the difference in rank (left) and the difference in free throw percentage (right):
You can clearly see the red skews to the right and the blue to the
left for the rank difference (suggesting a negative
difference means
team1
is more likely to win and positive difference,
team2
more likely). However for the free throw percentage there
isn't an apparent trend (they're both pretty well centered about a
difference of 0). The free throw percentage isn't very
predictive but if you look at the data above, the number of free
throw attempts (and thus also free throws made) does show a
predictive trend. This is because free throw percentages are
generally high regardless and thus aren't going to be very
predictive whereas the number of free throws attempted shows a team
that draws a lot of fouls.
Caveat: the last time I used this was for 2016. So it's been three years since I've gone through this. Thus a lot of this section is "to the best of my recollection".
I had the most success using a J48 tree. I used some others, like JRip, ZeroR, and Bayes algorithms (BayesNet and BayesNaive) but my best results from the J48 tree were consistently about 5 percentage points more accurate than the best for all of the others.
The percentage correct was pretty similar (about 80%) for the different J48 parameters so I found that looking at the ROC Area and Kappa-statistic to be the best indicator of a good run. There were three parameters that I varied when evaluating the J48 tree:
I probably tested more than those ranges, but those are the ones I settled on. As I recall, I spent quite a bit of time experimenting with those numbers to try and narrow in on a range that gave me good results.
Since I was using a website to generate my bracket from Javascript code (given their data and methods), I had to convert the tree generated by Weka to a javascript routine. I actually never generated a bracket myself so unfortunately, I don't have a bracket to show.
Once I settled on parameters for the J48 tree, the output from Weka looked something like this:
fgMadeDiff <= 1 | seedDiff <= 3 | | tovPerGameDiff <= -2.713964: team1 (23.77/7.02) | | tovPerGameDiff > -2.713964 | | | ftAttemptsDiff <= -136: team2 (41.13/5.0) | | | ftAttemptsDiff > -136 | | | | fgAttemptsDiff <= -105: team2 (116.47/30.48) | | | | fgAttemptsDiff > -105 | | | | | rebPerGameDiff <= -1.4613: team1 (23.39/2.91) | | | | | rebPerGameDiff > -1.4613 | | | | | | drbDiff <= -38: team2 (19.21/6.09) | | | | | | drbDiff > -38: team1 (48.04/21.82) | seedDiff > 3: team2 (283.0/34.0) fgMadeDiff > 1 | seedDiff <= -8: team1 (177.0/8.0) | seedDiff > -8 | | seedDiff <= 8 | | | fgAttemptsDiff <= 225 | | | | seedDiff <= 4 | | | | | stlDiff <= 66 | | | | | | rebPerGameDiff <= -2.385714: team1 (29.0/1.0) | | | | | | rebPerGameDiff > -2.385714 | | | | | | | totRebDiff <= -75: team2 (7.0) | | | | | | | totRebDiff > -75: team1 (114.0/39.0) | | | | | stlDiff > 66: team2 (38.0/14.0) | | | | seedDiff > 4 | | | | | astPerGameDiff <= 1.95 | | | | | | rebPerGameDiff <= 0.970656: team1 (23.0/8.0) | | | | | | rebPerGameDiff > 0.970656: team2 (10.0/1.0) | | | | | astPerGameDiff > 1.95: team2 (11.0/1.0) | | | fgAttemptsDiff > 225: team1 (144.0/24.0) | | seedDiff > 8: team2 (26.0/5.0)
This is just a giant
if-else
tree, so I just needed to correctly parse it to translate to
javascript (this was automated and above this are the definitions
for all of the values):
if(fgMadeDiff <= 1){ if(seedDiff <= 3){ if(tovPerGameDiff <= -2.713964){ team1.winsGame(); } else{ if(ftAttemptsDiff <= -136){ team2.winsGame(); } else{ if(fgAttemptsDiff <= -105){ team2.winsGame(); } else{ if(rebPerGameDiff <= -1.4613){ team1.winsGame(); } else{ if(drbDiff <= -38){ team2.winsGame(); } else{ team1.winsGame(); } } } } } } else{ team2.winsGame(); } } else{ if(seedDiff <= -8){ team1.winsGame(); } else{ if(seedDiff <= 8){ if(fgAttemptsDiff <= 225){ if(seedDiff <= 4){ if(stlDiff <= 66){ if(rebPerGameDiff <= -2.385714){ team1.winsGame(); } else{ if(totRebDiff <= -75){ team2.winsGame(); } else{ team1.winsGame(); } } } else{ team2.winsGame(); } } else{ if(astPerGameDiff <= 1.95){ if(rebPerGameDiff <= 0.970656){ team1.winsGame(); } else{ team2.winsGame(); } } else{ team2.winsGame(); } } } else{ team1.winsGame(); } } else{ team2.winsGame(); } } }